Whats inside the news?

Franziska Löw

December 16, 2017

Agenda

  1. Introduction

  2. Methodology

  3. Literature Review

  4. First Results

  5. Conclusion

Introduction

Online News

Business Model of online News

Research Question:

Does the business model have an effect on the editorial content?

Methodology:

  1. Estimate a Structural Topic Model
  2. Use posterior distribution to estimate the effect of document metadata.

Data

Online news articles about domestic politics from 01.06.2017 - 22.11.2017

Concepts

  • A single observation in a textual database is called a document.

  • The set of documents that make up the dataset is called a corpus.

  • Covariates associated with each document are called metadata.

Data Structure

  • Documents (articles) are stored on a “Document-Term-Matrix”
  • afd beruf cdu funktion helmut jahre kohl merkel prozent schulz
    4214 2 0 13 0 0 4 0 106 13 62
    5412 137 69 22 69 0 78 0 3 3 3
    5442 134 0 71 0 0 4 0 59 88 32
    582 0 0 19 0 144 28 235 3 1 0
    6282 206 94 27 94 0 104 0 5 2 3
  • Documents are seen as “bag of words”
  • Each article has metadata: publisher (news platform) and the day it was published.

How to find out latent topics in an article?

Methodology

Topic Model

Credits: Christine Doig

The intuition behind LDA

Credits: Blei (2012)

  • Model the generation of documents with latent topic structure
    • a topic ~ a distribution over words
    • a document ~ a mixture over topics
    • a word ~ a sample drawn from one topic

LDA as a graphical model

  • Nodes are random variables; arrows indicate dependence

  • Plates indicate replicated variables:
    • \(N =\) collection of words within a document.
    • \(D =\) collection of documents within a corpus.
  • Shaded nodes are observed; unshaded nodes are hidden
    • observed: word in a document \(w_{d,n}\)
    • fixed: mixture components (number of topics \(K\) & vocabulary)
    • hidden: mixture proportions (per-document topic proportions \(\theta_d\) & word-topic distribution \(\beta_k\))

Generative Process I

  • Generative Process for the \(n^{th}\) word-position in document \(d\):
    1. \(K\): choose the number of topics;
    2. \(\theta_d\): for each document \(d\), choose a distribution over topics;
    3. \(z_{d,n}\): according to \(\theta_d\), assign a topic for the \(n^{th}\) word;
    4. \(w_{d,n}\): choose a term from that topic according to \(\beta_k\)
    5. \(N\): repead this process for all \(n\) word-positions in the document.
    6. \(D\): conduct this process for all \(d\) documents in the corpus

R Markdown

test 1

test 2

Generative Process II

\(p(\beta_{1:K},\theta_{1:D},z_{1:D}, w_{1:D})=\displaystyle \prod_{i=1}^{K}p(\beta_i)\displaystyle \prod_{d=1}^{D}p(\theta_d)(\prod_{n=1}^Np(z_{d,n,}|\theta_d)p(w_{d,n}|\beta_{1:K},z_{d,n}))\)

draws <- rdirichlet(200, c(1,1,1) )
bivt.contour(draws)

Bayesian inference

Posterior probability: \(P(A|B)=\frac{P(B|A)P(A)}{P(B)}\)

\(p(\theta_{1:D},z_{1:D},\beta_{1:K}|w_{1:D}) = \frac{p(\theta_{1:D},\beta_{1:K},z_{1:D})(w_{1:D})}{p(w_{1:D})}\)

  • numerator: joint distribution of all random variables (can be easily computed for any setting of hidden variables.)
  • denominator: marginal probability of the observations (probability of seeing the observed corpus under any topic model.)

  • Then use posterior expectations to perform the task at hand: information retrieval, document similarity, exploration, and others.

Research Process

Strucutral Topic Model

  • We want to use estimates of \(\theta_d\) as the dependent variable in an regression on covariates to test whether different types of documents have different content.

  • This is contradictory because documents are assumed to be generated by the same statistical process.

  • The structural topic model (STM) of Roberts et. al. (2016) explicitly introduces covariates into a topic model, and allows one to estimate the impact of document-level covariates on topic content and prevalence as part of the topic model itself.

Topic Prevalence vs. Content

  • The process for generating individual words is the same as for plain LDA conditional on the \(\beta_k\) and \(\pi_d\) terms.

  • However both objects can depend on potentially different sets of document- level covariates. Each document has:
    1. Topic Prevalence: Attributes that affect the likelihood of discussing topic \(k\)
    2. Topic Content: Attributes that affect the likelihood of including term \(v\) overall, and of including it within topic \(k\)
  • The generation of the \(\beta_k\) and \(\pi_k\) terms is via multinomial logistic regression, which breaks local conjugacy.

Model Selection

  • There are three parameters we need to make assumptions about: number of topics \(K\) and priors \(\alpha, \eta\):
  • Priors don’t receive too much attention in literature:
    • Griffiths & Steyvers (2002) recommend \(\eta = 200/V\) and \(\alpha = 50/K\).
    • Smaller values will tend to generate more concentrated distributions. (See also Wallach et.al. (2009)).
  • \(K\) is clearer. Two potential goals:
    1. Predict text well. Statistical criteria to select \(K\).
    2. Interpretability. General versus specific.

Approximate Posterior

  • Mean field variational methods (Blei et al., 2001, 2003)
  • Expectation propagation (Minka and Lafferty, 2002)
  • Collapsed Gibbs sampling (Griffiths and Steyvers, 2002)
  • Distributed samplung (Newsman et al., 2008; Ahmed et al., 2012)
  • Stochastic inference (Hoffman et al., 2010, 2013; Mimno et al., 2012)
  • Factorization inference (Arora et al., 2012; Anandkumar et al., 2012)
  • Amortized inference (Srivastava and Sutton, 2019)

Predicitve Distributions

\(\beta_{k,v} = \frac{m_{k,v}+\eta}{\sum_{v=1}^V(m_{k,v}+\eta)}\)

\(\theta_{k,v} = \frac{n_{d,k}+\alpha}{\sum_{k=1}^K(n_{d,k}+\alpha)}\)

Formalizing Interpretability

  • Chang et. al. (2009) propose an objective way of determining whether topics are indeed interpretable.
  • Two tests:
  1. Word intrusion:
    • Form set of words consisting of top five words from topic k + word with low probability in topic k.
    • Ask subjects to identify inserted word.
  2. Topic intrusion:
    • Show subjects a snippet of a document + top three topics associated to it + randomly drawn other topic.
    • Ask to identify inserted topic.

Model Results

Topic Proportions

Sample Articles